Goto

Collaborating Authors

 task descriptor


Aligning Validation with Deployment: Target-Weighted Cross-Validation for Spatial Prediction

Brenning, Alexander, Suesse, Thomas

arXiv.org Machine Learning

Cross-validation (CV) is commonly used to estimate predictive risk when independent test data are unavailable. Its validity depends on the assumption that validation tasks are sampled from the same distribution as prediction tasks encountered during deployment. In spatial prediction and other settings with structured data, this assumption is frequently violated, leading to biased estimates of deployment risk. We propose Target-Weighted CV (TWCV), an estimator of deployment risk that accounts for discrepancies between validation and deployment task distributions, thus accounting for (1) covariate shift and (2) task-difficulty shift. We characterize prediction tasks by descriptors such as covariates and spatial configuration. TWCV assigns weights to validation losses such that the weighted empirical distribution of validation tasks matches the corresponding distribution over a target domain. The weights are obtained via calibration weighting, yielding an importance-weighted estimator that targets deployment risk. Since TWCV requires adequate coverage of the deployment distribution's support, we combine it with spatially buffered resampling that diversifies the task difficulty distribution. In a simulation study, conventional as well as spatial estimators exhibit substantial bias depending on sampling, whereas buffered TWCV remains approximately unbiased across scenarios. A case study in environmental pollution mapping further confirms that discrepancies between validation and deployment task distributions can affect performance assessment, and that buffered TWCV better reflects the prediction task over the target domain. These results establish task distribution mismatch as a primary source of CV bias in spatial prediction and show that calibration weighting combined with a suitable validation task generator provides a viable approach to estimating predictive risk under dataset shift.





FlashMoE: Fast Distributed MoE in a Single Kernel

Aimuyo, Osayamen Jonathan, Oh, Byungsoo, Singh, Rachee

arXiv.org Artificial Intelligence

The computational sparsity of Mixture-of-Experts (MoE) models enables sub-linear growth in compute cost as model size increases, thus offering a scalable path to training massive neural networks. However, existing implementations suffer from low GPU utilization, significant latency overhead, and a fundamental inability to leverage task locality, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel. FlashMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Unlike existing work, FlashMoE eliminates bulk-synchronous collectives for one-sided, device-initiated, inter-GPU (R)DMA transfers, thereby unlocking payload efficiency by eliminating bloated or redundant network payloads in sparsely activated layers. When evaluated on an 8-H100 GPU node with MoE models comprising up to 128 experts and 16K token sequences, FlashMoE achieves up to 9x higher GPU utilization, 6x lower latency, 5.7x higher throughput, and 4x better overlap efficiency compared to state-of-the-art baselines, despite using FP32, whereas the baselines use FP16. FlashMoE shows that principled GPU kernel-hardware co-design is key to unlocking the performance ceiling of large-scale distributed ML. We provide code at https://github.com/osayamenja/FlashMoE.


A with Gaussian processes

Neural Information Processing Systems

This section details how P AML can be combined with Gaussian processes, as in our experiments. Alternatively, one can use other probabilistic methods, e.g., Bayesian Neural Networks [1]. Secondly, it enables mini-batch training for further improvement in computational efficiency. During the evaluation, we compute the errors with respect to the normalized outputs, since the observed environments' state representations include dimensions of differing We use control signals that alternate back and forth from one end of the range to the other to generate trajectories. This policy resulted in better coverage of the state-space, compared to a random walk.



Automating Curriculum Learning for Reinforcement Learning using a Skill-Based Bayesian Network

Hsiao, Vincent, Roberts, Mark, Hiatt, Laura M., Konidaris, George, Nau, Dana

arXiv.org Artificial Intelligence

A major challenge for reinforcement learning is automatically generating curricula to reduce training time or improve performance in some target task. We introduce SEBNs (Skill-Environment Bayesian Networks) which model a probabilistic relationship between a set of skills, a set of goals that relate to the reward structure, and a set of environment features to predict policy performance on (possibly unseen) tasks. We develop an algorithm that uses the inferred estimates of agent success from SEBN to weigh the possible next tasks by expected improvement. We evaluate the benefit of the resulting curriculum on three environments: a discrete gridworld, continuous control, and simulated robotics. The results show that curricula constructed using SEBN frequently outperform other baselines.


AIDE: AI-Driven Exploration in the Space of Code

Jiang, Zhengyao, Schmidt, Dominik, Srikanth, Dhruv, Xu, Dixing, Kaplan, Ian, Jacenko, Deniss, Wu, Yuxiang

arXiv.org Artificial Intelligence

Machine learning, the foundation of modern artificial intelligence, has driven innovations that have fundamentally transformed the world. Yet, behind advancements lies a complex and often tedious process requiring labor and compute intensive iteration and experimentation. Engineers and scientists developing machine learning models spend much of their time on trial-and-error tasks instead of conceptualizing innovative solutions or research hypotheses. To address this challenge, we introduce AI-Driven Exploration (AIDE), a machine learning engineering agent powered by large language models (LLMs). AIDE frames machine learning engineering as a code optimization problem, and formulates trial-anderror as a tree search in the space of potential solutions. By strategically reusing and refining promising solutions, AIDE effectively trades computational resources for enhanced performance, achieving state-of-the-art results on multiple machine learning engineering benchmarks, including our Kaggle evaluations, OpenAI's MLE-Bench and METR's RE-Bench. The implementation of AIDE is publicly available at https://github.com/WecoAI/aideml.


Task-agnostic Prompt Compression with Context-aware Sentence Embedding and Reward-guided Task Descriptor

Liskavets, Barys, Roy, Shuvendu, Ushakov, Maxim, Klibanov, Mark, Etemad, Ali, Luke, Shane

arXiv.org Artificial Intelligence

The rise of Large Language Models (LLMs) has led to significant interest in prompt compression, a technique aimed at reducing the length of input prompts while preserving critical information. However, the prominent approaches in prompt compression often require explicit questions or handcrafted templates for compression, limiting their generalizability. We propose Task-agnostic Prompt Compression (TPC), a novel framework that generalizes compression across tasks and domains without requiring input questions or templates. TPC generates a context-relevant task description using a task descriptor trained on a curated dataset of context and query pairs, and fine-tuned via reinforcement learning with a reward function designed to capture the most relevant information. The task descriptor is then utilized to compute the relevance of each sentence in the prompt to generate the compressed prompt. We introduce 3 model sizes (Base, Large, and Huge), where the largest model outperforms the existing state-of-the-art methods on LongBench and ZeroSCROLLS benchmarks, and our smallest model performs comparable to the existing solutions while being considerably smaller.